High polygenic risk score is a risk factor associated with colorectal cancer based on data from the UK Biobank

Colorectal cancer (CRC) is a common cancer among both men and women and is one of the leading causes of cancer death worldwide. It is important to identify risk factors that may be used to help reduce morbidity and mortality of the disease. We used a case-control study design to explore the association between CRC, polygenic risk scores (PRS), and other factors. We extracted data about 2,585 CRC cases and 9,362 controls from the UK Biobank, calculated the PRS for these cases and controls based on 140 single nucleotide polymorphisms, and performed logistic regression analyses for the 11,947 cases and controls, for an older group (ages 50+), and for a younger group (younger than 50). Five significant risk factors were identified when all 11,947 cases and controls were considered. These factors were, in descending order of the values of the adjusted odds ratios (aOR), high PRS (aOR: 2.70, CI: 2.27–3.19), male sex (aOR: 1.52, CI: 1.39–1.66), unemployment (aOR: 1.47, CI: 1.17–1.85), family history of CRC (aOR: 1.44, CI: 1.28–1.62), and age (aOR: 1.01, CI: 1.01–1.02). These five risk factors also remained significant in the older group. For the younger group, only high PRS (aOR: 2.87, CI: 1.65–5.00) and family history of CRC (aOR: 1.73, CI: 1.12–2.67) were significant risk factors. These findings indicate that genetic risk for the disease is a significant risk factor for CRC even after adjusting for family history. Additional studies are needed to examine this association using larger samples and different population groups.


Data source
We used data from the UK Biobank, a population-based cohort study that collected blood samples from over 500,000 adults aged 40-70 years between 2006 and 2010, primarily across England, Scotland, and Wales.Samples underwent genotyping from blood derived cells using two arrays with a shared 95% marker content: the UK BiLEVE Axiom (UKBL; 807,411 markers) and the UK Biobank Axiom (UKBB; 825,927 markers).Genotype imputation was performed using reference panels from the Haplotype Reference Consortium, UK10K, and 1000 Genomes phase 3.In the biobank, 487,409 samples had imputed genotyping data available for this study.In addition to genetic data, the biobank also contains imaging data, health-related data, as well as sociodemographic and socioeconomic details for each participant.All participants were coded to protect their privacy.
In this study, we incorporated the following individual level data from the UK Biobank: family history, age, sex, body mass index (BMI), index of multiple deprivation (IMD), current tobacco smoking status, maternal smoking around birth, alcohol intake frequency, qualifications of education, current employment status, number of vehicles in the household, and average total house income.Family history of CRC, whether the father, mother, and siblings had CRC or not, is also available for this study.The units of BMI measurement were kg/m 2 .The IMD score was measured from seven distinct domains: income deprivation, employment deprivation, health deprivation and disability, education skills and training deprivation, barriers to housing and services, living environment deprivation, and crime.
The income deprivation domain examines income-related deprivation by counting individuals with low income across five indicators, such as those in income support families or receiving income-based Jobseeker's allowances.The employment deprivation domain focuses on labor market exclusion, combining indicators like Jobseeker's allowance claimants and Incapacity Benefit claimants.The health deprivation and disability domain evaluates premature mortality and reduced quality of life due to poor physical and mental health.The education skills and training deprivation domain assesses educational disadvantages for both children and adults.The barriers to housing and services domain considers geographical and financial obstacles to accessing housing and local services.The living environment deprivation domain evaluates indoor and outdoor living conditions.Lastly, the crime domain examines the recorded crime rate, including violence, burglary, theft, and criminal damage, as a reflection of personal and material victimization risk at a local level.
We dichotomized the variables based on their nature (Table 1).For example, participants were classified into those with a family history (father, mother, and siblings) of CRC (Yes) or not (No).Education level was categorized as either university or non-university education.The values of other categorical variables listed in Table 1 were coded in the same way.We exclusively included participants with complete records for all 13 variables listed in Table 1, and all 11,947 cases and controls included in the analysis had complete records.

Study subjects
We selected the CRC cases from the Biobank based on ICD-10 codes of C18.0-C18.9,C19, C20, and C26.0.Given that the majority of individuals in the genome-wide association study (GWAS) are of European ancestry, and considering the linkage disequilibrium (LD), allele frequency and gene-environment differences between populations, we included only samples of participants who are White British with complete imputed genotype information in the analyses [24].Individuals with genetic relationships closer than the second degree were excluded (kinship coefficient > 0.0884).Controls were selected from the remaining 349,660 participants who are White British and were not diagnosed with CRC.Controls were randomly selected by matching cases within a 5-year age difference and with a residence location in the same output area (OA).Output areas are small geographic areas constructed using aggregation of postcode areas.The final dataset used in the analyses contained 2,585 cases and 9,362 controls.
Ethics approval was not required for this study because UK Biobank data is open to all researchers, and the data has been de-identified.We did not have access to any information that could identify individual participants during or after data collection.

Polygenic risk score calculation
We calculated the PRS using 140 risk single nucleotide polymorphisms (SNPs) identified in a case-control study of CRC conducted by Thomas et al. [23].This study used blood-derived genetic sequence information from all patients.The list of risk SNPs data and corresponding effect size on the risk of CRC can be found in the study of Thomas et al. [23].One SNP in chromosome 13 (rs377429877) was missing in the imputed genotype data and was therefore excluded from the analyses.The SNPs in UK Biobank were imputed using the Haplotype Reference Consortium panel, with directly genotyped SNPs coded as 0, 1, or 2 copies of the risk allele, while imputed SNPs were coded as imputed dosages, indicating the anticipated number of risk allele copies.In general, we first extracted all the risk SNPs from the imputed genotyping data for each CRC case and control and then calculated the PRS as the sum of risk alleles of the respective variants (imputed dosages for imputed SNPs; 0, 1 or 2 copies of the risk alleles for genotyped SNPs).We used a scoring function in the PLINK 2.0 software [25] to calculate the PRS based on the imputed genotyping data in the UK Biobank.We followed the method used by Jia et al. and categorized individuals with a PRS in the top 5% in the high-risk group and other individuals in the low-risk group [26].

Statistical analysis
We computed the odds ratios (OR) using logistic regression analysis based on a case-control study design for the cases and controls in the final dataset.The variables included in the When performing the logistic regression analyses, we conducted univariate analysis to explore the impact of each variable on CRC individually.Additionally, we conducted multivariate analysis with all variables included in the model and compared the results with the univariate model.To examine how the results would differ when family history is excluded from the analysis and when only participants with top 5% and middle 41-60% PRS are considered, we conducted the same statistical analysis process on two sub-datasets selected from the current dataset: one comprising 10,142 participants without a family history of CRC, and the other consisting of participants with top 5% and middle 41-60% PRS.

Results
Among 11,947 participants with complete data used in this study, more than half of them (6,067) were female (50.8%), especially in the younger group (54.6%).The BMI values ranged from 15.27 to 54.52, with a standard deviation of 4.5; the mean and median BMI were 27.2 and 26.7, respectively.The IMD values ranged from 0.82 to 81.07, with a standard deviation of 12.1; the mean and median IMD were 14.5 and 10.8, respectively.The older group had a slightly higher proportion of participants with less than a university education (59.8% vs 57.8%), a family history of CRC (15.3% vs 12.8%), and a significantly higher proportion of participants who drank daily (26.5% vs 17.0%) and had household incomes below the poverty level (20.2% vs 8.5%), compared to the younger group (Table 1).Conversely, the younger group had a slightly higher proportion of participants who were unemployed (4.2% vs 3.4%), with a high PRS (6.0% vs 5.0%), had maternal smoking around birth (30.4% vs 28.3%), and were active smokers (10.7% vs 7.1%), compared to the older group.
A brief examination of the data indicates that: (1) among the 2,585 CRC cases, 251 participants (9.7%) had a high PRS, and 2,334 participants (90.3%) had a low PRS; (2) among the 9,362 controls, 355 participants (3.8%) had a high PRS, and 9,007 participants (96.2%) had a low PRS.A higher proportion of participants with a high PRS was observed in the case group compared to that in the control group.A two-proportions z-test (α = 0.05) indicates that the difference between these two observed proportions is significant (p-value < 2.2 e-16).
To better understand the association between PRS and CRC, we analyzed a sub-dataset from the original case-control dataset, which included 10,142 participants without a family history of CRC.Among these participants, 9,287 (1,919 cases and 7,368 controls) are in the older group, and 855 (161 cases and 694 controls) are in the younger group.Tables 5 and 6 summarize the analysis results related to this sub-dataset.The results indicated that, for participants without family history of CRC, the risk for those with a high PRS to develop CRC is more than 2.90 times greater (aOR: 2.90, CI: 2.40-3.50)than those with a low PRS (Table 5).Age (aOR: 1.02, CI: 1.01-1.02),sex (aOR: 1.42, CI: 1.28-1.57),and employment status (aOR: 1.61, CI: 1.26-2.07)remained significant risk factors associated with CRC, consistent with the results in Table 2.There were slight changes in the OR compared to the analysis when family history was included as a factor.The risk of developing CRC is even higher for participants younger than 50 with a high PRS.It is 3.65 times greater (aOR: 3.65, CI: 1.95-6.84)(Table 6).Furthermore, we analyzed a sub-dataset extracted from the original case-control dataset to compare it with prior findings, which included participants with a top 5% and middle 41-60% PRS, both with and without a family history of CRC.The sub-dataset that consisted of individuals with a family history of CRC comprised 2,988 participants, with 729 cases and 2,259 controls.The sub-dataset without a family history of CRC comprised 2,537 participants, with 590 cases and 1,947 controls.In the older group with a family history of CRC, there were 667 cases and 2,079 controls, whereas in the older group without a family history of CRC, there were 541 cases and 1,791 controls.Tables 7 and 8 summarize the outcomes for the selected participants at all ages.Supplementary tables (S1 and S2 Tables) summarize the outcomes for the selected older participants.The results demonstrate that individuals with a high PRS had two to three times greater risk (aOR: 2.86, CI: 2.36-3.47;aOR: 3.01, CI: 2.43-3.71) of developing CRC than those with a middle 41-60% PRS, regardless of their family history of CRC (Tables 7 and 8).These findings are consistent with previous results that categorized the PRS into high and low groups.

Discussion and conclusion
Findings from this study suggest that a high PRS is a potential risk factor associated with CRC, regardless of whether individuals are older than 50 or younger.In addition, results from this study indicate that the risk for people younger than 50 with a PRS in the top 5% to develop CRC is 3.65 times greater than those whose PRS falls within the other 95%.This relative risk is higher than that for people without a family history of CRC compared to those with a family history of CRC.It is worth noting that high PRS had a higher odds ratio than family history of CRC based on the results of all logistic regression analyses.These findings have implications for the implementation of CRC screening programs aimed at preventing CRC or detecting it at an early stage.We suggest that additional research is needed to evaluate the findings, and we recommend that individuals with a high PRS should consider participating in CRC screening, even if they do not have a family history of CRC.In addition, our results demonstrate that while family history encompasses some form of genetic disease risk, having additional information from the PRS adds to risk stratification.Previous studies have suggested that PRS is associated with CRC and has a stronger impact on early-onset CRC.Archambault et al. used 95 CRC-associated SNPs to study whether a PRS was associated with the risk of early-onset CRC [21].Their results showed that PRS was significantly associated with early-onset CRC, and the association was stronger than CRC in people older than 50 years.Mur et al. weighed 92-variant-based PRS into 20 quantiles to assess the contribution of PRS to family history of CRC and early-onset CRC [27].In their study, CRC patients in the highest weighted PRS quantile (the 20 th quantile), the top 5% weighted PRS, had a four-fold greater risk of developing CRC compared to those in the reference quantile (the 10 th quantile), the middle 46% -50% weighted PRS.Jia et al. used risk variants to identify high-risk individuals for eight common cancers.The results showed that individuals with the highest 5% PRS had a two-to-three-fold elevated risk for developing CRC [26].Ping et al. developed and validated PRS for CRC risk prediction in East Asians.Individuals within the top 5% of PRS had a 2.52-fold elevated CRC risk compared to those in the medium (41-60%) risk group [28].Those results are consistent with the finding in this study that participants with a PRS in the top 5% had a two-or three times higher risk of developing CRC compared to those whose PRS is not in the top 5%.
Other previous studies have examined the association between PRS and CRC along with various risk factors, including lifestyle [29,30], physical activity [31], consumption of red and processed meat [32], alcohol intake [33], smoking [34], frequency of colonoscopy [35,36], and the use of non-steroidal anti-inflammatory drugs [37].However, these previous studies primarily focused on PRS in isolation or in combination with just one additional relevant factor in their analyses.More comprehensive studies that incorporate PRS along with several risk factors are needed.Iba ´ñez-Sanz et al. developed a model to identify the CRC risk among Spanish population by using 21 CRC associated SNPs and incorporated environmental data such as lifestyle factors as well as family and medical history in their analysis [38].The results from that study indicated that alcohol consumption, obesity, physical activity, red meat and vegetable consumption, and nonsteroidal anti-inflammatory drug use increased the risk of developing CRC.These researchers suggested that family history of CRC and risky SNPs are also factors leading to higher risk of developing CRC.These results support the findings from our study that participants with alcohol intake and a family history of CRC experienced an elevated risk of CRC.Although the study by Iba ´ñez-Sanz et al. considered impact of multiple factors on CRC, they simply counted the risk alleles across all 21 SNPs to represent the genetic risk.However, this approach has its limitations because it does not consider the effect sizes of SNPs.The PRS used in our study accounts for the effect sizes of SNPs.
Studies also evaluated whether a healthy lifestyle can offset increased genetic risk in CRC [39,40].Healthy lifestyle scores were constructed using numbers of lifestyle factors, and were categorized into unhealthy (unfavorable), intermediate, and healthy (favorable) groups.However, these studies considered lifestyle factors as a whole and it is not clear which exact underlying factor was associated with the development of CRC.Differs from these studies, we included multiple lifestyle factors as well as other factors in the analysis and explored their association with CRC individually instead of using a composite lifestyle score.
One strong aspect of our study is that it incorporates PRS with several other relevant factors compared to previous studies reported in the literature.These factors included sociodemographic, socioeconomic, lifestyle, and family history of CRC.Findings from this study hence fill some gaps in the literature.A limitation of our study is the small sample size for participants younger than 50 years old.The limited number of young participants in the study may account for the observed results, where only family history and PRS remain as statistically significant risk factors, while sex, employment status, and age lose their significance within the younger age group.Further investigation is warranted to validate these findings once a larger dataset of younger participants becomes available.The results are limited to British Whites only, as PRS calculations based on the identified risk SNPs from previous GWAS primarily involved individuals of European ancestry.Future research endeavors should examine whether the results would hold in other population groups based on large sample sizes.

54 (1.24-1.92) < 0.001 1.47 (1.17-1.85) 0.001 OR
For the multivariate regression model, all 13 variables listed in the table were included.These variables are age, BMI, IMD, sex, PRS, family history, current tobacco smoking, alcohol intake frequency, household income, number vehicles in the household, maternal smoking around birth, education, and employment.
, Odds ratio; aOR, adjusted odds ratio;-, not applicable.aP value calculated by univariate logistic regression; significant at P < 0.05.bP value calculated by multivariate logistic regression; significant at P < 0.05.*For the univariate regression model, only one variable was included in each model.*

53 (1.22-1.93) < 0.001 1.49 (1.17-1.89) 0.001 OR
For the multivariate regression model, all 13 variables listed in the table were included.These variables are age, BMI, IMD, sex, PRS, family history, current tobacco smoking, alcohol intake frequency, household income, number vehicles in the household, maternal smoking around birth, education, and employment.
, Odds ratio; aOR, adjusted odds ratio;-, not applicable.aP value calculated by univariate logistic regression; significant at P < 0.05.bP value calculated by multivariate logistic regression; significant at P < 0.05.*For the univariate regression model, only one variable was included in each model.*

Table 4 . Results of logistic regression analysis: The younger group (<50 years old; 198 cases and 783 controls).
OR, Odds ratio; aOR, adjusted odds ratio;-, not applicable.aP value calculated by univariate logistic regression; significant at P < 0.05.bP value calculated by multivariate logistic regression; significant at P < 0.05.*For the univariate regression model, only one variable was included in each model.*For the multivariate regression model, all 13 variables listed in the table were included.These variables are age, BMI, IMD, sex, PRS, family history, current tobacco smoking, alcohol intake frequency, household income, number vehicles in the household, maternal smoking around birth, education, and employment.

Table 5 . Results of logistic regression analysis of the 10,142 participants without family history of CRC (2,080 cases and 8,062 controls).
For the multivariate regression model, all 12 variables listed in the table were included.These variables are age, BMI, IMD, sex, PRS, current tobacco smoking, alcohol intake frequency, household income, number vehicles in the household, maternal smoking around birth, education, and employment.
OR, Odds ratio; aOR, adjusted odds ratio;-, not applicable.a P value calculated by univariate logistic regression; significant at P < 0.05.b P value calculated by multivariate logistic regression; significant at P < 0.05.* For the univariate regression model, only one variable was included in each model.* https://doi.org/10.1371/journal.pone.0295155.t005

Table 6 . Results of logistic regression analysis of the 855 participants without family history of CRC in the younger group (161 cases and 694 controls).
For the multivariate regression model, all 12 variables listed in the table were included.These variables are age, BMI, IMD, sex, PRS, current tobacco smoking, alcohol intake frequency, household income, number vehicles in the household, maternal smoking around birth, education, and employment.
a P value calculated by univariate logistic regression; significant at P < 0.05.b P value calculated by multivariate logistic regression; significant at P < 0.05.* For the univariate regression model, only one variable was included in each model.* https://doi.org/10.1371/journal.pone.0295155.t006

75 (1.16-2.62) 0.007 1.75 (1.13-2.70) 0.011
P value calculated by multivariate logistic regression; significant at P < 0.05.For the univariate regression model, only one variable was included in each model.For the multivariate regression model, all 13 variables listed in the table were included.These variables are age, BMI, IMD, sex, PRS, family history, current tobacco smoking, alcohol intake frequency, household income, number vehicles in the household, maternal smoking around birth, education, and employment.

75 (1.11-2.74) 0.016 1.73 (1.06-2.82) 0.027 OR
For the multivariate regression model, all 12 variables listed in the table were included.These variables are age, BMI, IMD, sex, PRS, current tobacco smoking, alcohol intake frequency, household income, number vehicles in the household, maternal smoking around birth, education, and employment.
, Odds ratio; aOR, adjusted odds ratio;-, not applicable.a P value calculated by univariate logistic regression; significant at P < 0.05.b P value calculated by multivariate logistic regression; significant at P < 0.05.* For the univariate regression model, only one variable was included in each model.*